MissForest - non-parametric missing value imputation for mixed-type data

نویسندگان

Daniel J. Stekhoven

Peter Bühlmann

چکیده

MOTIVATION Modern data acquisition based on high-throughput technology is often facing the problem of missing data. Algorithms commonly used in the analysis of such large-scale data often depend on a complete set. Missing value imputation offers a solution to this problem. However, the majority of available imputation methods are restricted to one type of variable only: continuous or categorical. For mixed-type data, the different types are usually handled separately. Therefore, these methods ignore possible relations between variable types. We propose a non-parametric method which can cope with different types of variables simultaneously. RESULTS We compare several state of the art methods for the imputation of missing values. We propose and evaluate an iterative imputation method (missForest) based on a random forest. By averaging over many unpruned classification or regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the built-in out-of-bag error estimates of random forest, we are able to estimate the imputation error without the need of a test set. Evaluation is performed on multiple datasets coming from a diverse selection of biological fields with artificially introduced missing values ranging from 10% to 30%. We show that missForest can successfully handle missing values, particularly in datasets including different types of variables. In our comparative study, missForest outperforms other methods of imputation especially in data settings where complex interactions and non-linear relations are suspected. The out-of-bag imputation error estimates of missForest prove to be adequate in all settings. Additionally, missForest exhibits attractive computational efficiency and can cope with high-dimensional data. AVAILABILITY The package missForest is freely available from http://stat.ethz.ch/CRAN/. CONTACT [email protected]; [email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of imputation methods for missing laboratory data in medicine

OBJECTIVES Missing laboratory data is a common issue, but the optimal method of imputation of missing values has not been determined. The aims of our study were to compare the accuracy of four imputation methods for missing completely at random laboratory data and to compare the effect of the imputed values on the accuracy of two clinical predictive models. DESIGN Retrospective cohort analysi...

متن کامل

Random forest missing data algorithms

Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about the...

متن کامل

Parametric fractional imputation for mixed models with nonignorable missing data

Inference in the presence of non-ignorable missing data is a widely encountered and difficult problem in statistics. Imputation is often used to facilitate parameter estimation, which allows one to use the complete sample estimators on the imputed data set. We develop a parametric fractional imputation (PFI) method proposed by Kim (2011), which simplifies the computation associated with the EM ...

متن کامل

Enhancing Iterative Non-Parametric Algorithm for Calculating Missing Values of Heterogeneous Datasets by Clustering

Machine learning and data mining retort heavily on a large amount of data to build learning models and make predictions. There is a need for quality of data, thus the quality of data is ultimately important. Many of the industrial and research databases are plagued by the problem of missing values. A variety of methods have been developed with great success on dealing with missing values in dat...

متن کامل

Iterative Non - Parametric Method for Manipulating Missing Values of Heterogeneous Datasets by Clustering Fatigue and Corrosion Fatigue Behavior of Nickel Alloys in Saline Solutions

-Machine learning and data mining retort heavily on a large amount of data to build learning models and make predictions. There is a need for quality of data, thus the quality of data is ultimately important. Many of the industrial and research databases are plagued by the problem of missing values. A variety of methods have been developed with great success on dealing with missing values in da...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Bioinformatics

دوره 28 1 شماره

صفحات -

تاریخ انتشار 2012

MissForest - non-parametric missing value imputation for mixed-type data

نویسندگان

چکیده

منابع مشابه

Comparison of imputation methods for missing laboratory data in medicine

Random forest missing data algorithms

Parametric fractional imputation for mixed models with nonignorable missing data

Enhancing Iterative Non-Parametric Algorithm for Calculating Missing Values of Heterogeneous Datasets by Clustering

Iterative Non - Parametric Method for Manipulating Missing Values of Heterogeneous Datasets by Clustering Fatigue and Corrosion Fatigue Behavior of Nickel Alloys in Saline Solutions

عنوان ژورنال:

اشتراک گذاری